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Abstract 

We present a 2D nearest-neighbor quantum archi- 
tecture for Shor's factoring algorithm in polyloga- 
rithmic depth. Our implementation uses parallel 
phase estimation, constant-depth fanout and tele- 
portation, and constant-depth carry-save modular 
addition. We derive asymptotic bounds on the cir- 
cuit depth and width of our architecture and pro- 
vide a comparison to all previous nearest-neighbor 
factoring implementations. 

Keywords: quantum architecture, prime factor- 
ization, Shor's algorithm, nearest-neighbor, carry- 
save addition 

1 Introduction 

Shor's factoring algorithm is a central result in 
quantum computing, with an exponential speed-up 
over the best-known classical algorithm [i6|. As 
the most notable example of a quantum-classical 



complexity separation, much effort has been de- 
voted to implementations of factoring on a real- 
istic architectural model of a quantum computer 
[TH [2(3 EH [22I . We can bridge the gap between 
the theoretical algorithm and a physical implemen- 
tation by describing the layout and interactions of 
qubits at an intermediate, architectural level of ab- 
straction. This gives us a model for measuring cir- 
cuit resources and their tradeoffs. In this work, we 
present a novel quantum architecture for prime in- 
teger factorization in two dimensions that allows 
concurrent (parallel) two-qubit operations between 
neighboring qubits. 

Our paper is organized as follows. Section [2] in- 
troduces quantum architectural models, circuit re- 
sources, and constant-depth communication tech- 
niques due to (9} [13J. Section [3] places this work 
in the context of existing results. In Section^ we 
provide a self-contained pedagogical review of the 
carry-save technique and encoding. In Section [5] we 
extend the carry-save technique to a 2D modular 



adder, which we then use as a basis for a modular 
multiplier (Section^ and a modular exponentiator 
(Section [7). Finally, we analyze the asymptotic cir- 
cuit resources required by our approach and com- 
pare them to previous implementations in the re- 
lated work. 

2 Background 

Quantum architecture is concerned with the phys- 
ical layout of qubits and constraints on their inter- 
actions, as well as the efficient execution, in time, 
space, and other resources, of algorithms on a given 
architecture. In this paper, we focus on design- 
ing a realistic nearest-neighbor circuit for running 
Shor's factoring algorithm on architectural models 
of a physical quantum device. 

2.1 Architectural Models and 
Circuit Resources 

Following Van Meter [21], we distinguish between 
a model and an architectural implementation as fol- 
lows. A model is a set of constraints and rules for the 
placement and interaction of qubits. An architecture, 
or implementation, is a particular spatial layout of 
qubits (as a graph of vertices) and their constrained 
interactions (edges between the vertices), following 
the constraints of a given model. 

The most general model is called Abstract Con- 
current (AC) and allows arbitrary, long-range in- 
teractions between any qubits and concurrent op- 
eration of quantum gates. This corresponds to a 
complete graph with an edge between every pair of 
nodes, which is the model assumed in most quan- 
tum algorithms. 

A more specialized model restricts interactions 
to nearest-neighbor, two-qubit, concurrent gates 
(NTC) in a regular one-dimensional chain (1D 
NTC), which is sometimes called linear nearest- 
neighbor (LNN). This corresponds to a line graph. 

To relieve movement congestion, we can extend 
to a two-dimensional regular grid (2D NTC), where 
each qubit has four neighbors and there is an ex- 
tra degree of freedom in which to move data. In 



this paper, we extend the 2D NTC model in two 
ways. The first extension allows arbitrary planar 
graphs with bounded degree, rather than a regu- 
lar square lattice. Namely, we assume qubits lie 
in a plane and edges are not allowed to intersect, 
so that theoretically all qubits are accessible from 
above or below by control and measurement ap- 
paratus. Whereas 2D NTC conventionally assumes 
each qubit has four neighbors, we consider up to six 
neighbors in a roughly hexagonal layout. The sec- 
ond extension we make is the realistic assumption 
that classical control can access every qubit in paral- 
lel, and we do not count these classical resources in 
our implementation. We call these augmented mod- 
els CCAC and CCNTC following [T5I. The classi- 
cal controller corresponds to fast digital computers 
which are available in actual experiments and are 
necessary for constant-depth communication in the 
next section. 

We measure the efficiency of a circuit on a par- 
ticular architecture in terms of three resources: cir- 
cuit size (number of non-identity gates), circuit depth 
(number of time-steps), and circuit width (number 
of qubits). For circuit depth, a two-qubit gate takes 
one time-step and absorbs any adjacent single-qubit 
gates. Multiple two-qubit gates on disjoint qubits 
can occur in parallel during the same timestep. 

2.2 Constant-depth Teleportation 
and Fanout 

Two key problems in nearest-neighbor architec- 
tures deal with communication, namely moving 
and copying quantum information. How can we 
transport quantum information at one site to an- 
other over arbitrarily long distances? To solve this 
problem, we employ the constant-depth teleporta- 
tion circuit shown in part (a) of Figure [1] using stan- 
dard quantum circuit notation from [12]. 

The second problem is copying information. Al- 
though general cloning is impossible [12], we only 
need to perform unbounded quantum fanout, the 
operation \x,\j\, . . . ,y n ) — >> \x,y\ x, . . . ,y n x). 
This is used in our arithmetic circuits when a sin- 
gle qubit needs to control (be entangled with) a 
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Figure i: Constant-depth circuits based on |3}[4j for (a) teleportation [15) and (b) fanout jgj. 



large quantum register (called a fanout rail). We em- 
ploy a constant-depth circuit due to insight from 
measurement-based quantum computing [14] that 
relies on the creation of an n-qubit cat state M. It re- 
quires 0(l)-depth, 0(n )-size, and 0(n ) -width, and 
is shown in part (b) of Figure [1] for the case of fan- 
ning out I ip) to four qubits. The technique works by 
creating multiple small cat states of a fixed size (in 
this case, three qubits) and linking them together 
with Bell measurements. The qubits marked \£) 
are entangled into a (slightly) larger cat state, up 
to Pauli corrections. 

-^x^zi'x^zfx^x^zfzf (10000) + |1111)) 
V2 

(!) 

The operators X* and Z ] £ denote Pauli X and Z oper- 
ators on qubits i and £, controlled by classical bits k 
and j, respectively. These corrections are enacted by 
the classical controller based on the Bell measure- 
ment outcomes (not depicted). Unfortunately, this 
"consumes" the cat state in that there is no known 
way to unentangle the source qubit from the cat 



state after they have been jointly measured [13) . 

3 Related Work 

We extend the body of work which applies classi- 
cal ideas to quantum logic. Gossett (H) uses carry- 
save techniques to add numbers in constant-depth 
and multiply in logarithmic-depth using a special 
encoding, but at a quadratic cost in qubits (circuit 
width). The underlying idea of encoded adding, 
sometimes called a 3-2 adder, derives from Wallace 
trees p4| . 

Choi and Van Meter are the first to discuss 2D 
architectures by designing an adder that runs in 
&(y/n) -depth on 2D NTC |5j using O(n) -qubits 
with dedicated, special-purpose areas of a physical 
circuit layout. 

Takahashi and Kunihiro have also discovered a 
linear-depth and linear-size adder using zero an- 
cillae I17J, and also an adder with variable trade- 
offs between 0(n/d(n)) ancillae and 0(d(n))-depth 
for d(n) = (1 (log n) [19! which has better width 
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but worse depth than our adder. This approach 
assumes unbounded fanout, which has not been 
mapped to a nearest-neighbor circuit until the cur- 
rent work. 

Once an adder implementation is chosen, it can 
be extended to perform modular reduction, modu- 
lar multiplication, modular exponentiation, and ul- 
timately quantum period finding (QPF), the only 
quantum part of the factoring algorithm. Since 
Shor's algorithm is a probabilistic algorithm, re- 
quiring several rounds of QPF to amplify success 
probability, it suffices to determine the resources re- 
quired for a single round of QPF with a fixed, mod- 
est success probability. The original approach to 
QPF performs controlled modular exponentiation 
followed by an inverse quantum Fourier transform 
(QFT) [12]. We will call this serial QPF. 

This is the approach taken by all other factoring 
(QPF) implementations on any architectural model 
before the current work. For example, Beauregard 
(3 uses this QPF approach to construct a cubic- 
depth quantum period-finder using only In + 3 
qubits on AC, by combining the ideas of Draper's 
transform adder [6], Vedral et al/s modular arith- 
metic blocks [23!, and a semi-classical QFT. This ap- 
proach was subsequently adapted to 1D NTC by 
Fowler, Devitt, and Hollenberg [7] to achieve ex- 
act resource counts for an 0(n 3 ) -depth quantum 
period-finder. Kutin fill later improved this using 
an idea from Zalka for approximate multipliers to 
get a QPF circuit on 1D NTC in 0(n 2 ) -depth. Thus, 
there is only a constant overhead from Zalka's own 
factoring implementation on AC, also in quadratic 
depth pj$\ . Takahashi and Tani extend their earlier 
O(n) -depth adder to a factoring circuit in 0(n 3 )- 
depth but with linear width. 

All these works assume qubits are expensive 
(width) and that execution time (depth) is not the 
limiting constraint. We compare our work primar- 
ily against Kutin's method, and we make the alter- 
native assumption that ancillae are cheap and that 
fast classical control is available which can access 
all qubits in parallel. Therefore, we optimize circuit 
depth at the expense of width. 

Serial QPF is depth-limited by having to the per- 
form an inverse QFT. On an AC architecture, even 



when approximating the (inverse) QFT by truncat- 
ing two-qubit n/2 k rotations beyond k = O (log n), 
the depth is O(ftlogft) for factoring ft-bit numbers. 
There is an alternative, parallel version of phase es- 
timation described in Section 13 of [Tj, which de- 
creases depth in exchange for increased width and 
additional classical post-processing. This eliminates 
the need to do an inverse QFT. We refer the reader 
to ji] and [[T3I for details. Our factoring scheme em- 
ploys our 2D quantum arithmetic circuits and this 
parallel QPF, and we will show that it is asymptoti- 
cally more efficient than the other QPF method. We 
compare the circuit resources required by our work 
with the serial QPF implementations above in Table 
□ of Section^ 

Recent results by Browne, Kashefi, and Perdrix 
(BKP) connect the power of measurement-based 
quantum computing to the quantum circuit model 
augmented with unbounded fanout pff. Their 
model, which we adapt and call CCNTC, uses the 
classical controller mentioned in 2.2 They describe 



a constant-depth circuit for exact factoring, improv- 
ing on a constant-depth circuit for approximate fac- 
toring by Hoyer and Spalek [io"|. A direction for 
future work is to determine how our approach com- 
pares to the BKP result in terms of circuit size and 
width. 



4 The Constant-Depth 
Carry-Save Technique 

Our 2D factoring approach rests on the central tech- 
nique of the constant-depth carry-save adder (CSA) 
181 , which converts the sum of three numbers a, 
b, and c, to the sum of two numbers u and v: 
a J rb J rC = u J rV. To explain this technique and 
how it achieves constant depth, we need the follow- 
ing definitions. 

A conventional number x can be represented in n 
bits as x = YJiZo^Xi' where X{ € {0,1} denotes 
the zth bit of x, which we call an z'-bitQ Equiva- 
lently, x can be represented as a (non-unique) sum 



x It will be clear from the context whether we mean an z-bit, 
which has significance 2 Z , or an z-bit number. 
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of two smaller conventional numbers, u and v. We 
say [u + v) is a carry-save encoded, or CSE, number. 
The CSE representation itself consists of In — 2 in- 
dividual bits where Vq is always by convention. 

At the level of bits, a CSA converts the sum of 
three z-bits into the sum of an z-bit (the sum bit) and 
an (z + l)-bit (the carry bit): ai + b; + c z - = U{ + i? z - +1 . 
By convention, the bit U{ is the parity of the input 
bits (xi{ = U{ bf q) and the bit is the majority 
of {ai,bi,Ci\. See Figure [2] for a concrete example, 
where (w + v) has 2n — 2 = 8 bits, not counting vq. 

It will also be useful to refer to a subset of the 
bits in a conventional number using subscripts to 
indicate a range of indices. 

k 

x {j,k) = E 2 ^" x {i) = X (M) = 2 '**' ( 2 ) 

Using this notation, the following identity holds. 

x (j,k) = x (j,£) + X(£+u) for all ; < ^ < fc (3) 

We can express the relationship between the bits of 
x and (u + v) as follows. 

x = *(0,n-l) =U + V = M( ,„-2) + 0(i /n _i) (4) 

Finally, we will denote taking the modular residue 
of a number as follows: [m] = mod m. 



Using a Toffoli gate decomposition (see p. 182 
Ell), two control qubits and a single target qubit 
must be mutually connected to each other. Given 
this constraint, and the interaction of the CNOTs 
in Figure [3] we can rearrange these qubits on a 2D 
planar grid and obtain the layout shown in Figure 
[4] which satisfies our 2D NTC model. Note that 
this uses more gates and one more ancilla than the 
equivalent quantum full adder circuit in Figure 5 of 
00, but this is necessary to meet our architectural 
constraints and does not change the asymptotic re- 
sults. Also in Figure [4] is a variation called a 2-2 
adder, which simply re-encodes two z-bits into an 
z-bit and an [i + l)-bit, which will be useful in the 
next section. 



At the level of numbers, the sum of three n- 
bit numbers can be converted into the sum of 
two n-bit numbers by applying a CSA layer of n 
parallel, single-bit CSA's. Since each CSA oper- 
ates in constant depth, the entire layer also oper- 
ates in constant-depth, and we have achieved (non- 
modular) addition. 

An important consideration here is the circuit 
width. The circuit above operates out-of-place and 
produces two garbage qubits, the original inputs bj 
and C{. A single addition of three n-bit numbers 
requires a O(n) circuit width. 

5 Quantum Modular Addition 

To perform addition of two numbers a and b mod- 
ulo m, we consider the variant problem of modular 
addition of three numbers to two numbers: Given 
three n-bit input numbers a, b, and c and an n- 
bit modulus m, compute the following: (u + v) = 
{a + b + c) [m], where (u + v) is a CSE number. 

In this section, we provide an alternative, peda- 
gogical explanation of Gossett's modular reduction 
(8). Later, we contribute a mapping to a 2D architec- 
ture, using unbounded fanout to maintain constant- 
depth for adding back modular residues. This last 
step is missing in Gossett's original approach. 

To start, we will demonstrate the basic method 
of modular addition and reduction on an n-bit con- 
ventional number. In general, adding two n-bit con- 
ventional numbers will produce an overflow n-bit, 
which we can truncate as long as we add back its 
modular residue 2 n mod m. How can we guaran- 




Figure 4: The carry-save adder (CSA), or 3-2 adder, 
and carry-save 2-2 adder. 
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Figure 2: An example of carry-save encoding for the 5-bit conventional number 30. 
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Figure 3: Carry-save adder circuit for a single bit position i: Uj + bj + Cj = Uj + t?j+i. 



tee that we won't generate another overflow bit by 
adding back the modular residue? It turns out we 
can accomplish this by allowing a slightly larger 
input and output number (ft + 1 bits in this case), 
truncating multiple overflow bits, and adding back 
their modular residues. 

For an (n + 1 ) -bit conventional number x, we 
truncate its high-order bits x n and and add 

back their modular residue, X( n _i /n ) [m]. 

x mod m = X(o,n) [wi] 

= *(0,»-2)+*(n-l,n)M (5) 

Since both the truncated number X(o, n -2) and the 
modular residue are ft-bit numbers, their sum is an 
(n + l)-bit number as desired, equivalent to x[m]. 

Now we must do the same modular reduc- 
tion on a CSE number (u + v), which represents 
an (ft + l)-bit conventional number and has In 
bits. First, we truncate the three high-order bits 
(v n ,u n -i,v n -i) of (u + v), yielding an ft-bit con- 
ventional number with a CSE representation of 
2ft — 3 bits: {uq, U\, . . . , u n -{\ U {v\, Vi,..., v n -2}. 
Then we add back the three modular residues 
{v^[m},U( n _^[m\, 0(„_i) [m]), and we are guaran- 
teed not to get more overflow bits (of significance 



2 n 1 or higher). This equivalence is shown in Equa- 
tion [6] 

(u + v)[m] = (m( 0/ „-1) +^(l,n)) M 
= w (0,n-2) + ^(1,71-2) + 

^(n-l)M+^(n-l)M + 

i? (n) [m] (6) 

Lemma 1 (Modular Reduction in Constant Depth 
(8)). The modular addition of three n-bit numbers to two 
n-bit numbers can be accomplished in constant depth. 

Proof. Our goal is to show how to perform modular 
addition while keeping our numbers of a fixed size 
by treating overflow bits correctly. First, we enlarge 
our registers to allow the addition of (ft + 2) -bit 
numbers, while keeping our modulus of size n bits. 
(In Gossett's original approach, he takes the equiv- 
alent step of restricting the modulus to be of size 
(ft — 2) bits.) We accomplish the modular addition 
by first performing a layer of non-modular addition, 
truncating the three high-order overflow bits, and 
then adding back modular residues controlled on 
these bits in three successive layers, where we are 
guaranteed that no additional overflow bits are gen- 
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erated. This is illustrated for a 3-bit modulus, and 
5-bit registers, in Figure^ 



We use the following notation. The non-modular 
sum of the first layer is u and v. The CSE output 
of the first modular reduction layer is u' and v' , 
and the modular residue is written as c Vn+1 to mean 
the precomputed value 2 n+1 mod m controlled on 
v n +\. The CSE output of the second modular re- 
duction layer is u" and v" , and the modular residue 
is written as c Un+1 to mean the precomputed value 
2 n+1 mod m controlled on u n +\. The CSE output of 
the third and final modular reduction layer is u'" 
and v'" , and the modular residue is written as c Vn + 2 
to mean the precomputed value 2 n+1 mod m con- 
trolled on v n +2- 

We show that at no layer is an overflow (ft + 2)- 
bit generated, namely in the v component of any 
CSE output. (The u component will never exceed 
the size of the input numbers.) First, we know that 
no v' n+1 bit is generated after the first modular re- 
duction layer, because we have truncated away all 
(ft + l)-bits. Second, we know that no , 2 bit is 
generated because we only have one (ft + l)-bit to 
add, 

V-n+v Fixity/ we need to show a sufficient con- 
dition for no being generated in the third 
modular reduction layer. This bit is the majority 
of v'n+y and = 0. This means we only 
have to guarantee that at most one of and 
has value 1. This is equivalent to requiring that 
u'f , - \+v','i, < 3 • 2 n+1 , that is, the sum of these 

three bits has value at most 3. Bit is copied 
directly from i/ +1 by the rules of CSA, which im- 
plies the following condition for the second mod- 
ular reduction layer: u', s + v', , 1A < 3 • 2 n . This 

J (n) [n,n-\-l) — 

is true because u'^ + ^/ n+1 ) = M( n ) + V( n ) — 2 and 
v f ,j < 1. Everywhere we use the fact that the mod- 
ular residues are restricted to n bits. Therefore, 
the modular sum is computed as the sum of two 
(ft + 2) -bit numbers with no overflows in constant- 
depth. □ 

As a side note, we can perform modular reduc- 
tion in one layer instead of three by decoding the 



three overflow bits into one of seven different mod- 
ular residues. This can also be done in constant 
depth, and in this case we only need to enlarge all 
our registers to (ft + 1) bits instead of (ft + 2) as in 
the proof above. However, we omit this proof here 
for simplicity. 

To summarize, the circuit resources for modular 
addition are O(l) depth and O(ft) width. 

5.1 A Concrete Example of 
Modular Addition 



A 2D circuit for modular addition of 5-bit num- 
bers using four layers of parallel CSA's is shown 
graphically in Figure [6] which corresponds directly 
to the schematic proof in Figure [5] Figure [6] also 
represents the approximate physical layout of the 
qubits as they would look if this circuit were to be 
fabricated. Here, we convert the sum of three 5- 
bit integers into the modular sum of two 5-bit inte- 
gers, with a 3-bit modulus m. In the first layer, we 
perform 4 CSA's in parallel on the input numbers 
(a,b,c) and produce the output numbers (u,v). 

As described above, we truncate the three high- 
order bits during the initial CSA round (bits 
i?4, v$) to retain a 4-bit number. Each of these bits 
serves as a control for adding its modular residue 
to a running total. We can classically precompute 
2 4 [m] for the two additions controlled on and v± 
and 2 5 [m] for the addition controlled on 175. 

In layer 2, we use a constant-depth fanout rail (see 
Figure [i| to distribute the control bit ^4 to its modu- 
lar residue, which we denote as |c^ 4 ) = |2 4 [m] • 04). 
c Va has ft bits, which we add to the CSE results of 
layer 1. The results u\ and Vj+i are teleported into 
layer 3. The exception is v'± which is teleported into 
layer 4, since there are no other 4-bits to which it 
can be added. Wherever there are only two bits of 
the same significance, we use the 2-2 adder from [4] 

Layer 3 operates similarly to layer 2, except that 
the modular residue is controlled on \c Ua ) = 
|2 4 [m] • W4). c W4 has 3 bits, which we add to the CSE 
results of layer 2, where u\ and v' i+1 are teleported 
forward into layer 4. 
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Figure 5: A schematic proof of Gossett's constant-depth modular reduction for n = 3 



Layer 4 is similar to layers 2 and 3, with the mod- 
ular residue controlled on v$\ \c V5 ) = |2 5 [m] • v$). 
c° 5 has 3 bits, which we add to the CSE results of 
layer 3. There is no overflow bit and no carry 
bit from v% and v'^ as argued in Lemma 1. The final 
modular sum {a + b + c) [m] is u"' + v" f . 



6 Quantum Modular 
Multiplication 

We can build upon our carry-save adder to imple- 
ment quantum modular multiplication in logarith- 
mic depth. We start with a completely classical 
problem to illustrate the principle of multiplication 
by repeated addition. Then we consider modular 
multiplication of two quantum numbers in a serial 
and a parallel fashion in 6.1 Both of these problems 
use as a subroutine the generic problem of modular 
multiple addition which we define and solve in|6.2| 



First we consider a completely classical problem: 
given three n-bit classical numbers a, b, and m, com- 
pute c = ab mod m, where c is allowed to be in CSE. 

We only have to add shifted multiples of a to it- 
self, ''controlled' 7 on the bits of b. There are n shifted 
multiples of a, let's call them one for every bit 
of b: = 2 l abi mod m. We can parallelize the ad- 
dition of n numbers in a logarithmic depth binary 
tree to get a total depth of O(logn). 

6.1 Modular Multiplication of 
Two Quantum Numbers 

We now consider the problem of multiplying a clas- 
sical number controlled on a quantum bit and a 
quantum number^ which is a quantum superposi- 



2 In this paper, quantum numbers often result by entangling a 
classical number in one register with a quantum control bit. This 
should not be confused with the physics meaning of a quantum 
number. 
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Figure 6: Addition and three rounds of modular reduction for a 3 -bit modulus. 



tion of classical numbers: given an n-qubit quantum 
number \x), a control qubit \p), and two n-bit classi- 
cal numbers a and m, compute \c) = \xa[m]), where 
c is allowed to be in CSE. This problem occurs nat- 
urally in modular exponentiation (described in the 
next section) and can be considered serial multipli- 
cation, in that t quantum numbers are multiplied in 
series to a single quantum register. 



zW ), which 



We first create n quantum numbers 
are shifted multiples of the classical number a con- 



trolled on the bits of x: 



\2 l a[m] • Xi). How 

do we create these numbers, and what is the depth 
of the procedure? First, note that |2 z a[m]) is a clas- 
sical number, so we can precompute this classically 
and prepare them in parallel using single-qubit op- 
erations on n registers, each consisting of n ancil- 
lae qubits. Each n-qubit register will hold a future 



zWy value. We then copy all n bits of x, n times 
each, using an unbounded fanout operation so that 
n copies of each bit \x{) is next to register zW ^ . This 
takes a total of 0(n 2 ) parallel CNOT operations. 
We then entangle each z^^ with the correspond- 
ing X{. The schematic for this is shown in Figure [7} 
not showing how we interleave these numbers into 
groups of three using constant-depth teleportation. 
This reduces to the task of modular multiple addi- 
tion, in order to add these numbers down to a single 
number modulo m, which is described in|6.2[ 



Finally, we tackle the most interesting problem: 
given two n-qubit quantum numbers \x) and \y) 
and a n-bit classical number m, compute \c) = 
\xy mod m), where \c) is allowed to be in CSE. This 
can be considered parallel multiplication and is re- 
sponsible for our logarithmic speedup in modular 
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Figure 7: Creating n = 4 shifted values {z^\z^\z^ 2 \z^} for an input number x. 



we 



exponentiation. 

Instead of creating n quantum numbers 

must create n 2 numbers | z l, i ) for all possible pairs of 
quantum bits Xf and yy, i,j E {0, . . . ,n — 1}: |z^) = 
\2 l 2i[m] • x/ • yy). We create these numbers using a 
similar procedure to the previous problem. Adding 
n 2 quantum numbers of n qubits each takes depth 
0(log(ft 2 )) which is still O(logft). Creating n 2 x n- 
bit quantum numbers takes width 0(ft 3 ). 

6.2 Modular Multiple Addition 

As a subroutine to modular multiplication, we de- 
fine the operation of repeatedly adding multiple 
numbers down to a single CSE number, called mod- 
ular multiple addition. 

The modular multiple addition circuit generically 
adds down t x n-bit conventional numbers to an n- 
bit CSE number. 



,(°)+ z ( 1 )+...z( n - 1 ) = (u + v)[m] 



(7) 



It does not matter how the t numbers are generated, 
as long as they are divided into groups of three and 
have their bits interleaved to be the inputs of a CSA 
tile. In the cases above, serial multiplication results 
in t = n and parallel multiplication results in t = 
n 2 . At the beginning of the circuit, all CSA tiles 
are active in that they have tile input numbers z^ to 
multiply, and their tile outputs will affect the overall 
circuit output, u + v. 

As the circuit proceeds through a number of 
timesteps, tiles will become inactive when they do 
not receive new numbers for their tile inputs; at that 
point, their tile outputs can no longer affect the cir- 
cuit output. Since the CSA tile is a 3-2 adder, one 



can see that if there are t CSA tiles active at the be- 
ginning of a timestep, there are |~2£/3] active tiles 
at the end of the timestep, since there are roughly 
two-thirds as many input numbers left to add down 
to the circuit output u + v. One can see that the total 
number of timesteps is therefore |~log 3/2 (£/3)] + 1. 

To facilitate the below discussion, we will assign 
colors to each CSA tile, which are updated during 
the circuit execution. Active tiles can either be black 
or gray. A black tile will keep its two output num- 
bers as inputs and receive a third input number. An 
exception is the rightmost black tile may teleport 
one of its output numbers to its left black nearest 
neighbor and receive two input numbers from its 
right gray nearest neighbor. A gray tile will tele- 
port one of its output numbers to the nearest active 
tile to its left and the other output number to the 
nearest active tile to its right. An exception is the 
rightmost gray tile may teleport both output num- 
bers to its left black nearest neighbor. We can think 
of inactive tiles as white tiles in that they "fade" out 
of the circuit, and numbers get teleported through 
them without stopping to be added. The symbols 
for these colors are shown in Figure [8] 

The rules for updating tiles at the end of each 
timestep are as follows: 

• Black tiles are always active for the next 
timestep, but change colors as follows. 

- The leftmost tile always stays black. 

- If a black tile has a gray tile as its 
nearest active right neighbor in the cur- 
rent timestep, it stays black in the next 
timestep. 

- If a black tile has a black tile as its nearest 
active neighbor either to the right or the 
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Figure 8: From left to right, the symbols for a black, gray, and white tile, respectively 



left, and it is not the leftmost tile, it turns 
gray in the next timestep. 

• Gray tiles always turn white (inactive) in the 
next timestep. 

The initial state of the tile colors depends on its 
index i G {0, 1, . . . , q — 1} within q = \t/ 3] tiles. 

• If i mod 3 = 0, then it starts out black. 

• If i mod 3 = 1, then it starts out gray. 

• If i mod 3 = 2, then it starts out black. 

Given the rules above, one can see that the left- 
most tile stays black throughout the entire circuit, 
and holds the final output number (u + v) at the 
end. 

Each timestep of the circuit consists of the follow- 
ing operations: 

1. All active CSA tiles will execute in parallel to 
transform their three input numbers into two 
output numbers (a CSE number). 

2. Gray tiles teleport their output numbers to the 
left and to the right to their black tile neigh- 
bors. The exception is the rightmost gray tile 
will teleport both of its output numbers to its 
left black tile nearest neighbor. 

3. Tile colors will change according to the rules 
above. Approximately two-thirds of the tiles 
will become inactive in the next timestep. 

4. Go back to Step 1 for the next timestep. 



These steps and the above tile color rules are best 
illustrated with a concrete example. In Figure]^ we 
see the circuit for modular multiple addition as a se- 
ries of snapshots, separated by heavy dotted lines, 
with the passage of time going downward. The tiles 
change color over time, and the arrows indicate the 
teleportation of output numbers to neighboring ac- 
tive tiles in each timestep. In the initial timestep, the 
tiles are numbered to show how they are assigned 
their initial color. Between Timestep o and Timestep 
i, all [n/3] CSA tiles are active. After each succeed- 
ing timestep, [2/3 J fewer CSA tiles are active until 
the very end, when only one CSA tile is active. By 
the convention established above, we teleport the 
rightmost output numbers to the left, so that the fi- 
nal output is read out from the leftmost CSA tile. 

Now we can analyze the circuit resources for mul- 
tiplying ft -bit quantum numbers, which requires 
(t — 2) modular additions, for t = ft 2 . The cir- 
cuit width is the sum of the 0(ft 3 ) ancillae needed 
for number generation and the ancillae required for 
0(ft 2 ) modular additions. Each modular addition 
has width 0(n) and depth O(l) from the previ- 
ous section. There are |~log 3/2 (ft 2 /3)] + 1 timesteps 
of modular addition. Therefore the entire modu- 
lar multiplier circuit has depth O(logft) and width 
0(ft 3 ). 

7 Quantum Modular 
Exponentiation 

We now extend our arithmetic to modular exponen- 
tiation, which is repeated modular multiplication 
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Figure 9: Modular multiple addition of quantum numbers on a CSA tile architecture for t 
( [logs (f/3)l + 1) = 6 timesteps 



18 with depth 



controlled on qubits supplied by a phase estima- 
tion procedure. If we wish to multiply a ft-qubit 
quantum input number \x) by t classical numbers 
we can multiply them in series. This requires 
depth O(flogft) based on the modular multipliers 
in previous sections. 

Now consider the same procedure, but this time 
each classical number is controlled on a quan- 
tum bit pj. This is a special case of multiplying 
by t quantum numbers in series, since a classical 
number entangled with a quantum number is also 
quantum. It takes the same depth O(flogft) as the 
previous case. 

Finally, we consider multiplying t quantum num- 
bers {x^\x^\ . . . in a parallel, logarithmic 
depth, binary tree. This is shown in Figure [10] The 
tree has depth log 2 (f) in modular multiplier oper- 
ations. Furthermore, each modular multiplier op- 
eration has depth 0(log(ft)) for ft-qubit numbers. 



Therefore, the overall depth of this parallel modu- 
lar exponentiation structure is 0(log(f) log(ft)). In 
phase estimation for QPF, it is sufficient to take 
t = 0(n) [i2j [Tj. Therefore our total depth is 
O (log 2 (ft)) as desired. At this point, combined with 
the parallel phase estimation procedure of Q, we 
have a complete factoring implementation in our 2D 
nearest-neighbor architecture. 



For all known QPF procedures, there are t = 
O(ft) control bits needed, and also 0(n) modular 
multiplications in a tree of depth O (log ft). Each 
modular multiplication has depth O(logft) and 
width 0(ft 3 ). Therefore, the depth of the parallel 
modular exponentiation circuit above is O (log 2 ft) 
and the width is 0(ft 4 ). 
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t quantum numbers 



n qubits 




Figure 10: Parallel modular exponentiation: multiplying t quantum numbers in a 0(log (t) log (n))-depth 
binary tree. 



Implementation 


Architecture 


Depth 


Width 


Vedral, Barenco & Ekert [23] 


AC 


0(n 3 ) 


0(n) 


Gossett 


AC 


0(n log n) 


0(n 2 ) 


Beauregard |2j 


AC 


0(n 3 ) 


O(n) 


Zalka [25J 


AC 


0(n 2 ) 


O(n) 


Takahashi & Kunihiro [x8J 


AC 


0(n 3 ) 


0(n) 


Fowler, Devitt, Hollenberg jTj 


1DNTC 


0(« 4 ) 


0(n 3 ) 


Kutin [11] 


1D NTC 


0(n 2 ) 


0(n) 


Current Work 


2D NTC 


0(log 2 n) 


0(n 4 ) 



Table 1: Asymptotic resource usage for quantum factoring of an n-bit number. 



8 Results 

The resources required for our approach, as well 
as other nearest-neighbor approaches, are listed in 
Table [1] where the asymptotic resource bounds as- 
sume some fixed constant error probability for each 
round of period-finding. We achieve an exponen- 
tial improvement in nearest-neighbor circuit depth 
(from quadratic to polylogarithmic) with our ap- 
proach at the cost of a polynomial increase in cir- 
cuit width. Similar depth improvements at the cost 
of width increases can be achieved using the mod- 
ular multipliers of other factoring implementations 



by arranging them in a parallel, KSV-style modular 
exponentiator. 



9 Conclusions and Future Work 

In this paper, we have presented a 2D architec- 
ture for factoring on a quantum computer. Us- 
ing a combination of algorithmic improvements 
(carry-save adders and parallelized phase esti- 
mation) and architectural improvements (irregular 
two-dimensional layouts and constant-depth com- 
munication), we conclude that we can run the cen- 
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tral part of Shor's factoring algorithm (quantum 
period-finding) with asymptotically smaller depth 
than previous implementations. 

For future work, we would like to determine the 
exact width, depth, and size of our proposed factor- 
ing circuit, including the constants, as well as fur- 
ther optimizing our depth to be constant. Along 
those lines, Rosenbaum recently showed how to 
convert any n-qubit CCAC circuit to an equivalent 
CCNTC circuit in constant depth using n 2 ancil- 
lae [15). It is an interesting open question how a 
generic conversion of a constant-depth CCAC fac- 
toring architecture IgJ [Toj to CCNTC compares to 
our hand-optimized circuit. The depth of our circuit 
may also be improved by extending the carry-save 
adder from a 3 — > 2 circuit to any 2 n_1 — >> n circuit. 
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reprints for Governmental purposes notwithstand- 
ing any copyright annotation thereon. Disclaimer: 
The views and conclusions contained herein are 
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